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Abstract 

Given a set X and a function h : X —>■ {0,1} which labels each element of X 
with either 0 or 1, we may define a function /iC) to measure the similarity of pairs of 
points in X according to h. Specifically, for h S {0,1}^ we define h^^'> G {0, 
by := t[h{w) = h{x)]. This idea can be extended to a set of functions, 

or hypothesis space % C {0,1}''^ by defining a similarity hypothesis space := 

{hO : h G H}. We show that vc-dimension('H(®)) g 0(vc-dimension('H)). 

1 Introduction 

Consider the problem of learning from examples. We may learn by receiving class labels 
as feedback: ‘this is a dog’, ‘that is a wolf , ‘there is a cat’, etc. We may also learn 
by receiving similarity labels: ‘these are the same’, ‘those are different’ and so forth. 
In this note we study the problem of learning with similarity versus class labels. Our 
approach is to use the VC-dimension |VC71j to study the fundamental difficulty of this 
learning task. 

In the supervised learning model we are given a training set of patterns and associated 
labels. The goal is then to find a hypothesis function that maps patterns to labels 
that will predict with few errors on future data (small generalization error). A classic 
approach to this problem is empirical risk minimisation. Here the procedure is to choose 
a hypothesis from a set of hypothesis functions {hypothesis space) that ‘fits’ the data as 
closely as possible. If the hypothesis is from a hypothesis space with small VC-dimension 
and fits the data well then we are likely to predict well on future data |VC7Il IBEHW89] . 
The number of examples required to have small generalisation error with high probability 
is called the sample complexity. In the uniform learnahility model the VC-dimension 
gives a nearly matching upper and lower bound on the sample complexity |BEHW89l 
lEHKVS^ . In Theorem [1] we demonstrate that the VC-dimension of a hypothesis space 
with respect to similarity-labels is proportionally bounded by the VC-dimension with 
respect to class-labels indicating that the sample complexities within the two feedback 
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settings are comparable. That is, the fundamental difficulties of the two learning tasks 
are comparable. 

Related work 

We are motivated by the results of |GHP13j . Here the authors considered the problem of 
similarity prediction in the online mistake bound model [Lit88| . In [GHP131 Theorem 1] 
it was found that given a basic algorithm for class-label prediction with a mistake bound 
there exists an algorithm for similarity-label prediction with a mistake bound which was 
larger by no more than a constant factor. In this work we find an analogous result in 
terms of the VG-dimension. 


2 The VC-dimension of similarity hypothesis spaces 


A hypothesis space Ti C {0,1}^ is a set of functions from some set of patterns X 
to the set of labels Y = {0,1} in the two-class setting. The restriction of a function 
h G {0,1}^ to a subset X' Y X \s the function h\xi G {0,1}^^ with h\xi{x) := h{x) 
for each x G X'. Analogously, one can define the restriction of a hypothesis space as 
'H\x'-={h\x'-hen}. 

A subset X' C X is said to be shattered by % if 'H\x' = {0,1}^ , that is if the 
restriction contains all possible functions from X' to {0,1}. The VC-dimension [VG71] 
of a hypothesis space T-L C {0,1}^, denoted d{'H), is the size of the largest subset of X 
which is shattered by H, that is 

d{n) := max{|X'|:77U' = {0,l}^'}. 

yi.' 

Sauer’s lemma [VC711 ISau721 IShe72] . which gives a lower bound for the VC-dimension 
of a hypothesis space, will be used for proving our main result. It states that for a 
hypothesis space V. C {0,1}^, if 



( 1 ) 


then d{T-L) > m. 

Given a function h : X —^ {0,1}, we may define a function to measure the 
similarity of pairs of points in X according to h. Specifically, for h G {0,1}^ we define 
G {0,1}'’^^^ by h^^\w,x) := l[h{w) = h{x)], where 1 is the indicator function. 
This idea can be extended to a hypothesis space V. by defining the similarity hypothesis 
space := {h^^^ : h G 77}. We now give our central result. 

Theorem 1. Given a hypothesis space 77 C {0,1}"^, 

d(77) - 1 < c7(77(*)) < 5d{n ), 


with 6 = 4.55. 
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Proof. For the left hand inequality, let n := d{'H) and pick a set T = {xi,X 2 , ■ ■ ■ ,Xn} 
of size n which is shattered by PL. Then let T' = {(xi, X2), (x2, X3), ..., (x^-i, x„,)}. To 
demonstrate that T' is shattered by PL^^\ let g G {0,1}^ be any mapping from T' to 
{0,1}. Then since T is shattered by PL we may find a map h €PL with /i(xi) = 0 and 


, J h{xi) a g{xi,Xi+i) = 1 

[Xi+i) I l-h{xi) if fi((xi,Xj+i) = 0 

for f = 1,... ,n — 1. Observe that g = Since g was chosen arbitrarily, we may 

conclude that T' is indeed shattered by PL^^\ and therefore d{PL^’^'>) > |r'| = d{PL) — 1. 

For the right hand inequality, first let M := diPL^^'^) and then pick a set U = 
{(tci,xi), {w2,X2), • • •, iwM,XM)} of size M in X X X which is shattered by PL^^\ Let 
V = {wi,W 2 ,... ,wm,xi,X 2 , ■ ■ ■ ,xm} and note that \PL\v\ > = 2^. This is 

because any two maps h and g which agree on V will induce maps and g^^'i which 
agree on U, so PL^^^u cannot possibly contain more maps than PL\v- Using this fact, 
and applying Sauer’s Lemma (see ([I|)) to PL\v, we see that if 


2M 



then d{PL) > d{PL\v) > m. 

Now note the following inequality (see e.g., [FGOhl Lemma 16.19]), which bounds a 
sum of binomial coefficients: 



(0 < e < 1/2), 


(2) 


where H{e) := elog 2 ^ + (1 — e) log 2 denotes the binary entropy function. If we set 
m = 1 + [2eMj for some e < ^ such that H{e) < we have 





using ([2]) and that |U| < 2M from the definition of V. Thus Sauer’s lemma can be 
applied with the above value of m and hence 

d{PL) > 1 + L2eMj > 2eM = 2ed{PL ^^^), 


as long as H{e) < 1/2. 
that 


Observe that e = .11 satisfies this condition and thus we have 

< 4.55d{n) . 


□ 
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3 Discussion 


In the following, we give a family of examples where the VC-dimension of the similarity 
hypothesis space is exactly twice that of the original space. We use the following notation 
for the set of the first n natural numbers [n] := {1, 2,... , re}. 

Example 2. For the hypothesis space of k-sparse vectors, Hk ■= {h € {0, Ijl”! : 

ELi Hi) < k}, 

d{'Hk) = k and d{'H^jf^) = 2 k , 

provided that n>2k + l. 

Proof. Let X := [re]. Firstly note that diPik) > k, since any subset T C X with |T| < k 
is shattered by "Hk- If T' C X with \T'\ > k then T' cannot possibly be shattered by Pik 
since there is no element in PLk that labels all elements of T' as 1. Therefore d{'Hk) = k. 

To see that d{'H)^^) > 2k, let U = {(xi,X 2 ), (x 2 ,X 3 ),..., {x 2 k,X 2 k+i)} for any distinct 
elements xi,X 2 , ■ ■ ■, X 2 k+i G and note that |f7| = 2k. To show that U is shattered by 
let g G {0,1}^ be any function from U to {0,1}. We need to find an /i € Pik such 
that g = Two functions in {0,1}^ which satisfy the condition g = are h^ 

and hi defined by hQ{xi) = 0, /ii(xi) = 1 and 

, . _ J hj{xi) \i g{xi,Xi+i) = l 

} l-hj{xi) a g{xi,Xi+i) = 0 

hj{x) = 0 'Px ^ {XI,X 2 ,. . . ,X 2 k+l} 

ioi i = 1,... ,2k and j = 0,1. Observe that by construction, /lo(a^i) + hi{xi) = 1 for each 
i = l,...,2 k + l and therefore Hixi) + hi{xi) = YA^^[H{xi)+ hi{xi)] = 

2k + 1. This means that we must have hj{xi) < k for some j and hence hj G PLk 

with hH\jj = g. This proves that d{p6^'^) > 2k. 

Now suppose, for a contradiction, that d{Pi^^'^) > 2k. Then there is some set 
E = {{ui,vi),{u 2 ,V 2 ),... ,{u 2 k+i,V 2 k+i)} C X X X of size 2A: + 1 which is shattered 
by PL'^^^ Let V := {ui,U 2 ,..., U 2 k+i, vi, V 2 ,..., U 2 A:+i} (note that in general we do not 
necessarily have that |E| = 4k + 2 since the Ui and Vi need not all be distinct). 

Let G be the graph with vertex set V and edge set E. Observe that elements of 

PLk correspond to {0, l}-labellings of V and that elements of correspond to {0,1}- 

(s) 

labellings of E. Since E is shattered by PL^. , every labelling of E is realisable as the 
induced map hf^^ of some h G PLk. 

Note that G cannot contain a cycle since there is no labelling of V which could induce 
a similarity labelling on a cycle in which exactly one edge is labelled 0 and the rest are 
labelled lo. So the graph is a union of trees, also known as a ‘forest’. Note that in 

‘Indeed, under any such labelling of E any two vertices in the cycle are connected by two paths, one 
path containing exactly zero edges labelled with a 0 (implying that the two vertices are labelled the 
same) and one path containing exactly one edge labelled with a 0 (implying that the two vertices are 
labelled differently). 
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general the number of vertices in a forest is \V\ = |-E| + r, where \E\ is the number of 
edges and r is the number of trees in the forest. In this case we have \V\ = 2k + 1 + r. 
Now choose a labelling g, which labels the vertices of each connected component 

(tree) in G according to the following rule: for each connected component C in G, label 

ir^i ir^i 

vertices u € C with a 1 and the remaining 1"^] with a 0. Note that g ^ Tik since 


vgv c vec c 



> 


^ 1^1 ~ ^ 

^ 2 


1 ^ 1 -^ 

2 


k + -> k. 


Consider the edge labelling g^^'^\E- Since E is shattered by there must be some 

h G T-Lk such that But this is not possible, for if it were, then in order 

for to agree with we would need h\c = g\c or h\c = 1 — <710 for each connected 
component G in G. Swapping the labellings between 0 and 1 on one or more of the 
connected components can only increase the number of 1 labellings and thus 


> '^giv) > k 

v£V v€V 


SO h cannot be in T-Lk- Thus we have found a labelling of E, namely g^ ^\e, which cannot 
(s) 

be in . But this is a contradiction of our initial assumption that E was shattered by 
"H), h So we have proved that our assumption must have been incorrect and therefore 


= 2k. 


□ 


In Theorem [H the lower bound d(W) - 1 < d{'H^‘'">)) is tight, for example when 
kL = {0,However, observe that in Example [2l the hypothesis space of A;-sparse 
vectors, the similarity space “expands” only by a factor of 2, which is less than the factor 
5 = 4.55 of Theorem [TJ We leave as a conjecture that the upper bound in Theorem [1] 
can be improved to a factor of two. 
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